In this project, we’re examining inspection scores for restaurants in and around Austin, Texas, from 2014 through early 2017. Scores are determined through routine inspections, usually conducted twice a year. If a score lower than 70 is recorded in a yearly inspection, the restaurant will be subjected to one or more followup visits to ensure a return to compliance. All of our restaurant inspection records include geographical information in the form of zip codes, allowing the use of additional datasets to further explore the data. Visit https://data.world/kvaughn/s-17-dv-project-6 to explore our datasets, or see below for the visualizations we’ve made of the data.
Below we display our sessionInfo().
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.5
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] readr_1.1.0 data.world_0.1.2 dplyr_0.5.0 quantreg_5.33 SparseM_1.77
[6] plotly_4.6.0 ggplot2_2.2.1.9000 leaflet_1.1.0 DT_0.2 shinydashboard_0.5.3
[11] shiny_1.0.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.10 RColorBrewer_1.1-2 plyr_1.8.4 bitops_1.0-6 base64enc_0.1-3
[6] tools_3.3.2 digest_0.6.12 lattice_0.20-34 viridisLite_0.2.0 jsonlite_1.3
[11] evaluate_0.10 tibble_1.3.0 gtable_0.2.0 Matrix_1.2-7.1 DBI_0.5-1
[16] crosstalk_1.0.0 curl_2.3 yaml_2.1.14 stringr_1.1.0 httr_1.2.1
[21] knitr_1.15.1 hms_0.3 MatrixModels_0.4-1 htmlwidgets_0.8 rprojroot_1.2
[26] grid_3.3.2 R6_2.2.0 rmarkdown_1.3 RJSONIO_1.3-0 purrr_0.2.2
[31] tidyr_0.6.1 magrittr_1.5 backports_1.0.5 scales_0.4.1 htmltools_0.3.5
[36] rsconnect_0.7 assertthat_0.1 mime_0.5 xtable_1.8-2 colorspace_1.3-2
[41] httpuv_1.3.3 stringi_1.1.2 RCurl_1.95-4.8 lazyeval_0.2.0 munsell_0.4.3
The restaurant inspection dataset was provided through the City of Austin’s Open Data Portal, https://data.austintexas.gov/. This free download is available in multiple formats at https://data.austintexas.gov/Health/Restaurant-Inspection-Scores/uthw-a7ih. For this project, we downloaded a CSV file and cleaned it using the Extract-Translate-Load procedure outlined below.
The dataset we received needed some cleaning, as it was unreadable to Tableau in its original CSV formatting. We used the Extract-Translate-Load method discussed in class, and built an R script found at “../01 Data/ETLRestaurant.R” We also cleaned up the join query used in our Tableau visualizations; it can be found at “../01 Data/ETLQuery.R” If you’d like to follow along, here’s how we processed the restaurant inspection data:
First, download the uncleaned CSV: https://data.austintexas.gov/api/views/ecmv-9xxi/rows.csv?accessType=DOWNLOAD. Prepend “PreETL_” to the file name, and put it in a folder called CSVs in the same directory as this project (s17dvproject5-vaughn-cannata-martinez).
Note that we have one column of numeric data (“Score”), one column of dates (“Inspection Date”), three columns of strings (“Restaurant Name”, “Address”, and “Process Description”), and two columns of numeric data used categorically (“Zip Code” and “Facility ID”). For the sake of ease of use, these last two columns will be treated as strings.
# Set working directory
setwd("~/s17dvfinalproject-vaughn-cannata-martinez-1/00 Docs")
# Set filepath
file_path = "../../CSVs/PreETL_Restaurant_Inspection_Scores.csv"
# Create dataframe from original CSV
df <- read.csv(file_path, stringsAsFactors = FALSE)
# Standardize column names
names(df) <- gsub("\\.+", " ", names(df))
# Print column names and types
str(df)
'data.frame': 25072 obs. of 7 variables:
$ Restaurant Name : chr "Savory Spice Shop" "Sam's Club #6453 Produce & Groceries" "Seoul Food" "Seoul Food" ...
$ Zip Code : int 78759 78759 78759 78759 78759 78759 78759 78759 78759 78759 ...
$ Inspection Date : chr "10/20/16" "11/18/14" "10/18/16" "10/22/15" ...
$ Score : int 100 100 78 92 100 85 92 97 97 77 ...
$ Address : chr "10000 RESEARCH BLVD SB Unit 129\nAUSTIN, TX 78759\n(30.392352, -97.746845)" "9700 N CAPITAL OF TEXAS HWY SB\nAUSTIN, TX 78759\n(30.390516, -97.742262)" "8650 SPICEWOOD SPRINGS RD\nAUSTIN, TX 78759\n(30.433565, -97.771812)" "8650 SPICEWOOD SPRINGS RD\nAUSTIN, TX 78759\n(30.433565, -97.771812)" ...
$ Facility ID : int 10658566 2803420 10955767 10955767 2803921 2800291 2803420 2803920 10408633 10004376 ...
$ Process Description: chr "Routine Inspection" "Routine Inspection" "Routine Inspection" "Routine Inspection" ...
We’ve grouped the date column separately from measures and dimensions; while Tableau will recognize these data as dates, it seems prudent to reclassify them into a standardized format.
# Select string data as 'dimensions'
dimensions <- c("Restaurant Name", "Zip Code", "Address", "Process Description", "Facility ID")
# Select dates as 'dates'
dates <- c("Inspection Date")
# Select all remaining data as 'measures'
measures <- setdiff(names(df), union(dimensions, dates))
Here we remove special characters from all of our data. These characters can serve no purpose in the data, and might cause problems in the analysis process. To improve readability, special characters have been replaced by spaces.
# Get rid of special characters in each column.
for(n in names(df)) {
df[n] <- data.frame(lapply(df[n], gsub, pattern="[^ -~]",replacement= " "), stringsAsFactors = FALSE)
}
Now we go through all of the character-based columns (grouped as dimensions) and remove troublesome characters. Ampersands are replaced with “and”, semicolons are replaced with colons, and single- and double-quotes are removed outright. Data of NA (null) is replaced with an empty string.
# This function will replace NA data with an empty string
na2emptyString <- function (x) {
x[is.na(x)] <- ""
return(x)
}
# We'll apply this to all columns grouped as dimensions
if( length(dimensions) > 0) {
for(d in dimensions) {
# Change NA to the empty string.
df[d] <- data.frame(lapply(df[d], na2emptyString), stringsAsFactors = FALSE)
# Get rid of " and ' in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern="[\"']",replacement= ""), stringsAsFactors = FALSE)
# Change & to and in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern="&",replacement= " and "), stringsAsFactors = FALSE)
# Change : to ; in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern=":",replacement= ";"), stringsAsFactors = FALSE)
}
}
Here we ensure that dates are formatted correctly.
if( length(dates) > 1 || ! is.na(dates)) {
for(y in dates) {
# Format as dates
df[y] <- data.frame(lapply(df[y], function(y) as.Date(y, format = "%m/%d/%y")), stringsAsFactors = FALSE)
}
}
Now we take all of the non-numeric characters out of our measures.
na2zero <- function (x) {
x[is.na(x)] <- 0
return(x)
}
# Get rid of all characters in measures except for numbers, the - sign, and period.dimensions, and change NA to 0.
if(length(measures) > 1 || ! is.na(measures)) {
for(m in measures) {
df[m] <- data.frame(lapply(df[m], gsub, pattern="[^--.0-9]",replacement=""), stringsAsFactors = FALSE)
df[m] <- data.frame(lapply(df[m], na2zero), stringsAsFactors = FALSE)
df[m] <- data.frame(lapply(df[m], function(m) as.numeric(as.character(m)))) # This is needed to turn measures back to numeric because gsub turns them into strings.
}
}
Always a useful thing. Let’s look at our data and see what we’ve got. We’re expecting all data to be in character form except “Score”, which will be numeric, and our date column named “Inspection Date”.
# Take a look and make sure it's what you think it should be:
print(summary(df))
Restaurant Name Zip Code Inspection Date Score Address
Length:25072 Length:25072 Min. :2014-03-31 Min. : 36.00 Length:25072
Class :character Class :character 1st Qu.:2015-02-04 1st Qu.: 87.00 Class :character
Mode :character Mode :character Median :2015-10-10 Median : 93.00 Mode :character
Mean :2015-10-13 Mean : 91.36
3rd Qu.:2016-07-08 3rd Qu.: 97.00
Max. :2017-03-10 Max. :100.00
Facility ID Process Description
Length:25072 Length:25072
Class :character Class :character
Mode :character Mode :character
Our data looks good, and we need to write it to a new file.
# Now write it to a new clean file.
write.csv(df, gsub("PreETL_", "", file_path), row.names=FALSE, na = "")
Our Tableau file is available at https://github.com/CannataUTDV/s17dvfinalproject-vaughn-cannata-martinez-1/01_Data/Final.twb
As can be seen in the visualization below, the most common raw restaurant inspection score is between 90 and 100. The 90-95 and 95-100 bins are essentially the same. The second most common ranges are 85-90 and 100-105. All of the results in the 100-105 bin are in fact 100, indicating that “perfect” scores are very much achievable in the Austin restaurant inspection rubric.
The next largest bin is half the size of the second largest, and this pattern continues until we drop below 70. This is particularly good news, as it implies that failing scores (those below 70) are not very common at all, and furthermore that exceptional scores are highly common.
Here we see a comprehensive visualization that reflects data as we’ve seen it before, in our “interesting things” section. This scatterplot shows restaurant and stop density per zipcode. Each point is sized by the population of that zipcode. Ultimately, we see an overall positive trend. However, upon closer examination, it is clear that many high population zipcodes have a lower restaurant and stop density.
The implications of this have already been discussed in previous sections. However, it is also interesting to note that many zipcodes (regardless of population size), are grouped towards the lower to medium restaurant and stop density area. In fact, the further out in either direction you go, the less instances there are. This likely shows that while there is a slight positive correlation between stop and restaurant density, most areas are not likely to have an exceptional number of eihter.
The screenshots and descriptions below are based on our interactive visualization app, accessible at https://kvaughn.shinyapps.io/restaurant_inspection/.
One low score does not indicate a systemic problem with a restaurant’s cleanliness; before judging a restaurant with a low score, it behooves us to consider the other scores it has received. This boxplot allows the user to do just that. First select a cutoof point for scores, then choose a zipcode to see all restaurants that have scored below that level at least one. Each restaurant’s scores will display, giving the user an idea of whether the low score was a fluke or a sign of problems.
This visualization allows the user to get a broader view of the restaurant inspection scores. The user is given the option to select scores from any combination of years. What’s interesting here is that the vast majority of restaurant inspections result in “passing” grades of 70 or higher.
The Scatterplot tab includes three separate visualizations drawn from one set of data. The first visualization plots the population of a zipcode against the number of restaurants in that zipcode. The second graph plots the population per zipcode against the quantity of transit stops in that zipcode. (Transit stop location information provided by Capital Metro and made available through data.texas.gov: https://data.texas.gov/dataset/GTFS/r4v4-vz24 ) As you see, the correlations between population and restaurant or transity-stop quantity are pretty weak. The third visualization in this set graphs transit stops by restaurants, and shows a much stronger correlation. See above for a discussion of one possible explanation for this pattern of relationships.
Click here to be taken to our published Shiny app: https://kvaughn.shinyapps.io/restaurant_inspection/
This visualization starts with a somewhat-arbitrary KPI, the “Safety Index,” that takes into account both the average score and the lowest inspection score received for a given zip code in a given year. A high Safety Index indicates both a high average score and a high minimum score. Set the parameters however you like (3 and 8 are reasonable defaults) and check the crosstab to see each zip code’s ratings over the four-year period.
This visualization allows the user to set an arbitrary cutoff point as an aid for identifying low inspection scores. The default value, 70, is the cutoff for restaurant inspections; lower scores will trigger a reinspection. After selecting a parameter using the slider and requesting data, the user can select the “Crosstab” tab to see which zip codes and years had unacceptably low inspection scores.
Austin is a fast-growing city, and our restaurant scene is currently booming as well. This visualization allows a user to see where the density of restaurants compared to population is highest. Using a public population dataset generated by the US Census, we can find the number of people per restaurant in a given zip code; as more restaurants are added (or fail), the people-per-restaurant number changes. The default values highlight the density of restaurants in 78701, which is downtown Austin.
These visualizations are also accessible at https://kvaughn.shinyapps.io/restaurant_inspection/
Barcharts can be a good way to quickly identify trends visually. In this visualization, the user can create a set of zip codes of interest, then see each zipcode’s average score per year over the three or four years contained in the dataset.
Capital Metro, the Austin-area transit service, has made its data freely available for public download. By importing this data into geographic information software (ArcGIS) and overlaying the coordinates of transit stops on a shapefile of regional zip codes, we were able to count the number of transit stops per zip code. In this bar chart, we’re examining the relationship between transit density and restaurant density. Use the slider to select zip codes with more than the minimum number of transit stops, then compare the number of restaurants in these zip codes to the average number of restaurants per zip code throughout the region. The blue line is the calculated average for all of the selected zip codes, and the red is the grand average for all of the zipcodes in the Greater Austin area.
Would we expect zip codes with comparatively large numbers of nonnative (naturalized-citizen or noncitizen) residents to have more or fewer restaurants, on average, than zip codes with lower percentages of nonnative residents? This barchart allows the user to select a minimum percentage of nonnative residents per zip code, then shows the number of restaurants in that zip code and the average number of restaurants per zip code for all of the qualifying zip codes. This average can be compared to the grand average throughout the greater Austin area, shown in red.